Python correct encoding of Website (Beautiful Soup)

您所在的位置：网站首页 › page encoding › Python correct encoding of Website (Beautiful Soup)

Python correct encoding of Website (Beautiful Soup)

2023-03-22 06:58| 来源: 网络整理| 查看: 265

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None parser = 'html.parser' # or lxml or html5lib soup = BeautifulSoup(r.content, parser, from_encoding=encoding)

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text() print text

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

【本文地址】

Python correct encoding of Website (Beautiful Soup)

Python correct encoding of Website (Beautiful Soup)

今日新闻

推荐新闻